N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
Identifieur interne : 000D58 ( Main/Exploration ); précédent : 000D57; suivant : 000D59N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
Auteurs : Artur Šili [Croatie] ; Jean-Hugues Chauchat [France] ; Bojana Dalbelo Baši [Croatie] ; Annie Morin [France]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2007.
Abstract
Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.
Url:
DOI: 10.1007/978-3-540-77002-2_56
Affiliations:
- Croatie, France
- Auvergne-Rhône-Alpes, Rhône-Alpes, Région Bretagne
- Bron, Rennes
- Université de Rennes 1
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001D02
- to stream Istex, to step Curation: 001B82
- to stream Istex, to step Checkpoint: 000773
- to stream Main, to step Merge: 000D71
- to stream Main, to step Curation: 000D58
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author><name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</author>
<author><name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
</author>
<author><name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
</author>
<author><name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-77002-2_56</idno>
<idno type="url">https://api.istex.fr/document/E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001D02</idno>
<idno type="wicri:Area/Istex/Curation">001B82</idno>
<idno type="wicri:Area/Istex/Checkpoint">000773</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Sili A:n:grams:and</idno>
<idno type="wicri:Area/Main/Merge">000D71</idno>
<idno type="wicri:Area/Main/Curation">000D58</idno>
<idno type="wicri:Area/Main/Exploration">000D58</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author><name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
<affiliation wicri:level="1"><country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author><name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, 5 avenue Pierre Mendès France, 69676 Bron Cedex</wicri:regionArea>
<placeName><region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Bron</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<affiliation wicri:level="1"><country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author><name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
<affiliation wicri:level="4"><country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rennes 1, IRISA, 35042 Rennes Cedex</wicri:regionArea>
<placeName><region type="region" nuts="2">Région Bretagne</region>
<settlement type="city">Rennes</settlement>
</placeName>
<orgName type="university">Université de Rennes 1</orgName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<idno type="DOI">10.1007/978-3-540-77002-2_56</idno>
<idno type="ChapterID">56</idno>
<idno type="ChapterID">Chap56</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.</div>
</front>
</TEI>
<affiliations><list><country><li>Croatie</li>
<li>France</li>
</country>
<region><li>Auvergne-Rhône-Alpes</li>
<li>Rhône-Alpes</li>
<li>Région Bretagne</li>
</region>
<settlement><li>Bron</li>
<li>Rennes</li>
</settlement>
<orgName><li>Université de Rennes 1</li>
</orgName>
</list>
<tree><country name="Croatie"><noRegion><name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</noRegion>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</country>
<country name="France"><region name="Auvergne-Rhône-Alpes"><name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
</region>
<name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D58 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D58 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB |texte= N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus }}
This area was generated with Dilib version V0.6.32. |